February 21, 2015

Learning goals

  • When to use the parallel package in R
  • How to parallelize apply functions
  • How to parallelize a process that generates random numbers

Why?

Why parallel?

  • Use your whole computer for embarassingly parallel tasks
  • Large gain in efficiency for a small amount of effort

Why parallel?

The parallel package merges multicore and snow

  • Ships with recent versions of R
  • Integrated handling of random-number generation
  • By default, uses multicore functionality on Unix-like systems and snow functionality on Windows
  • (Can also use the snow functionality to execute on a cluster for both systems)

Getting started

Setup

Make the desired number of cores available to R:

require(parallel)

# Determine number of CPU cores in your machine
nCores = detectCores()

# Create cluster with desired number of cores
cl = makeCluster(nCores)

MenuMeters

Parallelizing apply

  • foreach parallelizes for loops
  • parallel does the same for apply functions

Parallelizing apply

input = 1:3
sapply(input, sqrt)
## [1] 1.000000 1.414214 1.732051
lapply(input, sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051

Parallelizing apply

Consider this simple example:

input = 1:100

testFun = function(i){
  mu = mean(runif(1e+06, 0, i))
  return(mu)
}

(There is something wrong here… we'll come back to this.)

Parallelizing apply

How would we do this with foreach?

system.time(
  mu.foreach <- foreach(i=1:100,
                     .combine = "c") %dopar% {
                          testFun(i)
                        }
)

Parallelizing apply

Now let's try it with sapply and parSapply

system.time(
  sapply(input, testFun)
  )
##    user  system elapsed 
##   3.790   0.112   3.911
system.time(
  parSapply(cl, input, testFun)
  )
##    user  system elapsed 
##   0.003   0.000   1.940

Parallelizing apply

Now let's try it with lapply and parLapply

system.time(
  lapply(input, testFun)
  )
##    user  system elapsed 
##   5.809   0.112   5.931
system.time(
  parLapply(cl, input, testFun)
  )
##    user  system elapsed 
##   0.001   0.000   1.900

Parallelizing apply

We could also use mclapply

# not available on Windows
system.time(
  mclapply(input, testFun, mc.cores=nCores)
  )
##    user  system elapsed 
##   5.934   0.208   2.268

clusterExport

base = 2
clusterExport(cl, "base")
parLapply(cl, 1:3, function(x){base^x})
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 8

Random number generation

The issue

  • Need to be careful when parallelizing a process which generates (psuedo-) random numbers
  • Want the different processes to run independent (and reproducible) random-number streams

clusterSetRNGStream

Let's go back and fix our first example:

input = 1:100
testFun = function(i){ mean(runif(1e+06, 0, i)) }
clusterSetRNGStream(cl, iseed=2015)
res1 = parSapply(cl, input, testFun)

clusterSetRNGStream(cl, iseed=2015)
res2 = parSapply(cl, input, testFun)
identical(res1, res2)
## [1] TRUE

Example: bootstrapping

Bootstrapping

  • An embarassingly parallel task
  • Resamples the data – uses random number generation

Bootstrapping

  • Predict number of arrivals at 5pm on a weekday for the NOHO station
  • Point estimate: 23.3 arrivals
  • Create a 95% bootstrap confidence interval for this prediction
locations

Bootstrapping

run1 = function(...){
  require(boot); require(splines)
  load(url("http://www.stat.colostate.edu/~scharfh/CSP_parallel/data/arrivals_subset.RData"))
  bikePred = function(data, indices){
    d = data[indices,]
    big.glm <- glm(arrivals ~
                     bs(hour, degree = 4)*weekend
                   + bs(hour, degree = 4)*as.factor(id),
                   data = d, family = "poisson")
    mynewdat = data.frame(weekend=FALSE, id=293, hour=17)
    return(predict(object=big.glm, newdata=mynewdat, type="response"))
  }
  boot(data=arrivals.sub, statistic=bikePred, R=250)
}

Bootstrapping

set.seed(2015)
system.time(
  bike.boot <- do.call(c, lapply(seq_len(nCores), run1))
  )
##    user  system elapsed 
## 170.026   1.799 173.381
clusterSetRNGStream(cl, iseed=2015)
system.time(
  bike.boot2 <- do.call(c, parLapply(cl, seq_len(nCores), run1))
  )
##    user  system elapsed 
##   0.007   0.002  70.793

Bootstrapping

histogram

Bootstrapping

boot.ci(bike.boot2, type="perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bike.boot2, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (21.54, 24.83 )  
## Calculations and Intervals on Original Scale

Since the boot package has built-in parallel support, we could also simply run the following:

nBoot = nCores*250
set.seed(2015, kind="L'Ecuyer")
system.time(
  bike.boot3 <- boot(data=arrivals.sub, statistic=bikePred, R=nBoot, 
                    parallel="multicore", ncpus=4)
)
##    user  system elapsed 
## 181.781   2.518  64.943

Stop cluster

Don't forget to stop your cluster when you are done using it.

stopCluster(cl)

Further Reading